Goto

Collaborating Authors

 Northern Province


From Phonemes to Meaning: Evaluating Large Language Models on Tamil

Varsha, Jeyarajalingam, Velayuthan, Menan, Karunakaran, Sumirtha, Nivethiga, Rasan, Sarveswaran, Kengatharaiyer

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1--13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model's overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.


Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Chang, Tyler A., Arnett, Catherine, Eldesokey, Abdelrahman, Sadallah, Abdelrahman, Kashar, Abeer, Daud, Abolade, Olanihun, Abosede Grace, Mohammed, Adamu Labaran, Praise, Adeyemi, Sharma, Adhikarinayum Meerajita, Gupta, Aditi, Iyigun, Afitab, Simplício, Afonso, Essouaied, Ahmed, Chorana, Aicha, Eppa, Akhil, Oladipo, Akintunde, Ramesh, Akshay, Dorkin, Aleksei, Kondoro, Alfred Malengo, Aji, Alham Fikri, Çetintaş, Ali Eren, Hanbury, Allan, Dembele, Alou, Niksarli, Alp, Arroyo, Álvaro, Bajand, Amin, Khanna, Amol, Chkhaidze, Ana, Condez, Ana, Mkhonto, Andiswa, Hoblitzell, Andrew, Tran, Andrew, Poulis, Angelos, Majumder, Anirban, Vacalopoulou, Anna, Wong, Annette Kuuipolani Kanahele, Simonsen, Annika, Kovalev, Anton, S, Ashvanth., Lana, Ayodeji Joseph, Kinay, Barkin, Alhafni, Bashar, Busole, Benedict Cibalinda, Ghanem, Bernard, Nathani, Bharti, Đurić, Biljana Stojanovska, Agbonile, Bola, Bergsson, Bragi, Fischer, Bruce Torres, Tutar, Burak, Çınar, Burcu Alakuş, Kane, Cade J. Kanoniakapueo, Udomcharoenchaikit, Can, Arnett, Catherine, Helwe, Chadi, Nerella, Chaithra Reddy, Liu, Chen Cecilia, Nwokolo, Chiamaka Glory, España-Bonet, Cristina, Amol, Cynthia, Lee, DaeYeop, Arad, Dana, Dzenhaliou, Daniil, Pugacheva, Daria, Choi, Dasol, Abolade, Daud, Liu, David, Semedo, David, Popoola, Deborah, Mataciunas, Deividas, Nyaboke, Delphine, Kumar, Dhyuthy Krishna, Glória-Silva, Diogo, Tavares, Diogo, Goyal, Divyanshu, Lee, DongGeon, Anajemba, Ebele Nwamaka, Grace, Egonu Ngozi, Mickel, Elena, Tutubalina, Elena, Herranen, Elias, Anand, Emile, Habumuremyi, Emmanuel, Ajiboye, Emuobonuvie Maria, Yulianrifat, Eryawan Presma, Adenuga, Esther, Rudnicka, Ewa, Itiola, Faith Olabisi, Butt, Faran Taimoor, Thekkekara, Fathima, Haouari, Fatima, Tjiaranata, Filbert Aurelian, Laakom, Firas, Grasso, Francesca, Orabona, Francesco, Periti, Francesco, Solomon, Gbenga Kayode, Ngo, Gia Nghia, Udhehdhe-oze, Gloria, Martins, Gonçalo, Challagolla, Gopi Naga Sai Ram, Son, Guijin, Abdykadyrova, Gulnaz, Einarsson, Hafsteinn, Hu, Hai, Saffari, Hamidreza, Zaidi, Hamza, Zhang, Haopeng, Shairah, Harethah Abu, Vuong, Harry, Kuulmets, Hele-Andra, Bouamor, Houda, Yu, Hwanjo, Debess, Iben Nyholm, Deveci, İbrahim Ethem, Hanif, Ikhlasul Akmal, Cho, Ikhyun, Calvo, Inês, Vieira, Inês, Manzi, Isaac, Daud, Ismail, Itzhak, Itay, Iuliia, null, Alekseenko, null, Belashkin, Ivan, Spada, Ivan, Zhelyazkov, Ivan, Brinton, Jacob, Isbarov, Jafar, Čibej, Jaka, Čuhel, Jan, Kocoń, Jan, Krito, Jauza Akbar, Purbey, Jebish, Mickel, Jennifer, Za, Jennifer, Kunz, Jenny, Jeong, Jihae, Dávalos, Jimena Tena, Lee, Jinu, Magalhães, João, Yi, John, Kim, Jongin, Chataignon, Joseph, Imperial, Joseph Marvin, Thevakumar, Jubeerathan, Land, Judith, Jiang, Junchen, Kim, Jungwhan, Sirts, Kairit, R, Kamesh, V, Kamesh, Tshinu, Kanda Patrick, Kukk, Kätriin, Ponkshe, Kaustubh, Huseynova, Kavsar, He, Ke, Buchanan, Kelly, Sarveswaran, Kengatharaiyer, Zaman, Kerem, Mrini, Khalil, Kyars, Kian, Kruusmaa, Krister, Chouhan, Kusum, Krishnakumar, Lainitha, Sánchez, Laura Castro, Moscoso, Laura Porrino, Choshen, Leshem, Sencan, Levent, Øvrelid, Lilja, Alazraki, Lisa, Ehimen-Ugbede, Lovina, Thevakumar, Luheerathan, Thavarasa, Luxshan, Malik, Mahnoor, Keita, Mamadou K., Jangid, Mansi, De Santis, Marco, García, Marcos, Suppa, Marek, D'Ciofalo, Mariam, Ojastu, Marii, Sikander, Maryam, Narayan, Mausami, Skandalis, Maximos, Mehak, Mehak, Bozkurt, Mehmet İlteriş, Workie, Melaku Bayu, Velayuthan, Menan, Leventhal, Michael, Marcińczuk, Michał, Potočnjak, Mirna, Shafiei, Mohammadamin, Sharma, Mridul, Indoria, Mrityunjaya, Habibi, Muhammad Ravi Shulthan, Kolić, Murat, Galant, Nada, Permpredanun, Naphat, Maugin, Narada, Corrêa, Nicholas Kluge, Ljubešić, Nikola, Thomas, Nirmal, de Silva, Nisansa, Joshi, Nisheeth, Ponkshe, Nitish, Habash, Nizar, Udeze, Nneoma C., Thomas, Noel, Ligeti-Nagy, Noémi, Coulibaly, Nouhoum, Faustin, Nsengiyumva, Buliaminu, Odunayo Kareemat, Ogundepo, Odunayo, Fejiro, Oghojafor Godswill, Funmilola, Ogundipe Blessing, God'spraise, Okechukwu, Samuel, Olanrewaju, Oluwaseun, Olaoye Deborah, Akindejoye, Olasoji, Popova, Olga, Snissarenko, Olga, Chiemezie, Onyinye Anulika, Kinay, Orkun, Tursun, Osman, Moses, Owoeye Tobiloba, Joshua, Oyelade Oluwafemi, Fiyinfoluwa, Oyesanmi, Gamallo, Pablo, Fernández, Pablo Rodríguez, Arora, Palak, Valente, Pedro, Rupnik, Peter, Ekiugbo, Philip Oghenesuowho, Sahoo, Pramit, Prokopidis, Prokopis, Niau-Puhipau, Pua, Yahya, Quadri, Mignone, Rachele, Singhal, Raghav, Kadiyala, Ram Mohan Rao, Merx, Raphael, Afolayan, Rapheal, Rajalakshmi, Ratnavel, Ghosh, Rishav, Oji, Romina, Solis, Ron Kekeha, Guerra, Rui, Zawar, Rushikesh, Bashir, Sa'ad Nasir, Alzaabi, Saeed, Sandeep, Sahil, Batchu, Sai Pavan, Kantareddy, SaiSandeep, Pranida, Salsabila Zahirah, Buchanan, Sam, Rutunda, Samuel, Land, Sander, Sulollari, Sarah, Ali, Sardar, Sapkota, Saroj, Tautvaisas, Saulius, Sen, Sayambhu, Banerjee, Sayantani, Diarra, Sebastien, M, SenthilNathan., Lee, Sewoong, Shah, Shaan, Venkitachalam, Shankar, Djurabaeva, Sharifa, Ibejih, Sharon, Dutta, Shivanya Shomir, Gupta, Siddhant, Suárez, Silvia Paniagua, Ahmadi, Sina, Sukumar, Sivasuthan, Song, Siyuan, A., Snegha, Sofianopoulos, Sokratis, Simon, Sona Elza, Benčina, Sonja, Gvasalia, Sophie, More, Sphurti Kirit, Dragazis, Spyros, Kaufhold, Stephan P., S, Suba., AlRashed, Sultan, Ranathunga, Surangika, Someya, Taiga, Pungeršek, Taja Kuzman, Haklay, Tal, Jibril, Tasi'u, Aoyama, Tatsuya, Abashidze, Tea, Cruz, Terenz Jomar Dela, Blevins, Terra, Nikas, Themistoklis, Idoko, Theresa Dora, Do, Thu Mai, Chubakov, Tilek, Gargiani, Tommaso, Rathore, Uma, Johannesen, Uni, Ugwu, Uwuma Doris, Putra, Vallerie Alexandra, Kumar, Vanya Bannihatti, Jeyarajalingam, Varsha, Arzt, Varvara, Nedumpozhimana, Vasudevan, Ondrejova, Viktoria, Horbik, Viktoryia, Kummitha, Vishnu Vardhan Reddy, Dinić, Vuk, Sewunetie, Walelign Tewabe, Wu, Winston, Zhao, Xiaojing, Diarra, Yacouba, Nikankin, Yaniv, Mathur, Yash, Chen, Yixi, Li, Yiyuan, Xavier, Yolanda, Belinkov, Yonatan, Abayomi, Yusuf Ismail, Alyafeai, Zaid, Shan, Zhengyang, Tam, Zhi Rui, Tang, Zilu, Nadova, Zuzana, Abbasi, Baber, Biderman, Stella, Stap, David, Ataman, Duygu, Schmidt, Fabian, Gonen, Hila, Wang, Jiayi, Adelani, David Ifeoluwa

arXiv.org Artificial Intelligence

To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.


Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification

Manivannan, Siyamalan

arXiv.org Artificial Intelligence

Deep Learning has emerged as a promising approach for skin lesion analysis. However, existing methods mostly rely on fully supervised learning, requiring extensive labeled data, which is challenging and costly to obtain. To alleviate this annotation burden, this study introduces a novel semi-supervised deep learning approach that integrates ensemble learning with online knowledge distillation for enhanced skin lesion classification. Our methodology involves training an ensemble of convolutional neural network models, using online knowledge distillation to transfer insights from the ensemble to its members. This process aims to enhance the performance of each model within the ensemble, thereby elevating the overall performance of the ensemble itself. Post-training, any individual model within the ensemble can be deployed at test time, as each member is trained to deliver comparable performance to the ensemble. This is particularly beneficial in resource-constrained environments. Experimental results demonstrate that the knowledge-distilled individual model performs better than independently trained models. Our approach demonstrates superior performance on both the \emph{International Skin Imaging Collaboration} 2018 and 2019 public benchmark datasets, surpassing current state-of-the-art results. By leveraging ensemble learning and online knowledge distillation, our method reduces the need for extensive labeled data while providing a more resource-efficient solution for skin lesion classification in real-world scenarios.


Building Tamil Treebanks

Sarveswaran, Kengatharaiyer

arXiv.org Artificial Intelligence

Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power. The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators. Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.


Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Velayuthan, Menan, Sarveswaran, Kengatharaiyer

arXiv.org Artificial Intelligence

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.


Dynamic Subgoal based Path Formation and Task Allocation: A NeuroFleets Approach to Scalable Swarm Robotics

Peter, Robinroy, Ratnabala, Lavanya, Charles, Eugene Yugarajah Andrew, Tsetserukou, Dzmitry

arXiv.org Artificial Intelligence

This paper addresses the challenges of exploration and navigation in unknown environments from the perspective of evolutionary swarm robotics. A key focus is on path formation, which is essential for enabling cooperative swarm robots to navigate effectively. We designed the task allocation and path formation process based on a finite state machine, ensuring systematic decision-making and efficient state transitions. The approach is decentralized, allowing each robot to make decisions independently based on local information, which enhances scalability and robustness. We present a novel subgoal-based path formation method that establishes paths between locations by leveraging visually connected subgoals. Simulation experiments conducted in the Argos simulator show that this method successfully forms paths in the majority of trials. However, inter-collision (traffic) among numerous robots during path formation can negatively impact performance. To address this issue, we propose a task allocation strategy that uses local communication protocols and light signal-based communication to manage robot deployment. This strategy assesses the distance between points and determines the optimal number of robots needed for the path formation task, thereby reducing unnecessary exploration and traffic congestion. The performance of both the subgoal-based path formation method and the task allocation strategy is evaluated by comparing the path length, time, and resource usage against the A* algorithm. Simulation results demonstrate the effectiveness of our approach, highlighting its scalability, robustness, and fault tolerance.


Tamil Language Computing: the Present and the Future

Sarveswaran, Kengatharaiyer

arXiv.org Artificial Intelligence

This paper delves into the text processing aspects of Language Computing, which enables computers to understand, interpret, and generate human language. Focusing on tasks such as speech recognition, machine translation, sentiment analysis, text summarization, and language modelling, language computing integrates disciplines including linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions. Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation. In examining the landscape of language computing, the paper emphasises foundational work like encoding, where Tamil transitioned from ASCII to Unicode, enhancing digital communication. It discusses the development of computational resources, including raw data, dictionaries, glossaries, annotated data, and computational grammars, necessary for effective language processing. The challenges of linguistic annotation, the creation of treebanks, and the training of large language models are also covered, emphasising the need for high-quality, annotated data and advanced language models. The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs, highlighting gaps in current technology. It calls for increased research collaboration, digitization of historical texts, and fostering digital usage to ensure the comprehensive development of Tamil language processing, ultimately enhancing global communication and access to digital services.


Morphology and Syntax of the Tamil Language

Sarveswaran, Kengatharaiyer

arXiv.org Artificial Intelligence

This paper provides an overview of the morphology and syntax of the Tamil language, focusing on its contemporary usage. The paper also highlights the complexity and richness of Tamil in terms of its morphological and syntactic features, which will be useful for linguists analysing the language and conducting comparative studies. In addition, the paper will be useful for those developing computational resources for the Tamil language. It is proven as a rule-based morphological analyser cum generator and a computational grammar for Tamil have already been developed based on this paper. To enhance accessibility for a broader audience, the analysis is conducted without relying on any specific grammatical formalism.

  Country:
  Genre: Overview (0.68)
  Industry: Education (0.67)

Evolutionary Swarm Robotics: Dynamic Subgoal-Based Path Formation and Task Allocation for Exploration and Navigation in Unknown Environments

Ratnabala, Lavanya, Peter, Robinroy, Charles, E. Y. A.

arXiv.org Artificial Intelligence

This research paper addresses the challenges of exploration and navigation in unknown environments from an evolutionary swarm robotics perspective. Path formation plays a crucial role in enabling cooperative swarm robots to accomplish these tasks. The paper presents a method called the sub-goal-based path formation, which establishes a path between two different locations by exploiting visually connected sub-goals. Simulation experiments conducted in the Argos simulator demonstrate the successful formation of paths in the majority of trials. Furthermore, the paper tackles the problem of inter-collision (traffic) among a large number of robots engaged in path formation, which negatively impacts the performance of the sub-goal-based method. To mitigate this issue, a task allocation strategy is proposed, leveraging local communication protocols and light signal-based communication. The strategy evaluates the distance between points and determines the required number of robots for the path formation task, reducing unwanted exploration and traffic congestion. The performance of the sub-goal-based path formation and task allocation strategy is evaluated by comparing path length, time, and resource reduction against the A* algorithm. The simulation experiments demonstrate promising results, showcasing the scalability, robustness, and fault tolerance characteristics of the proposed approach.


BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

Leong, Wei Qi, Ngui, Jian Gang, Susanto, Yosephine, Rengarajan, Hamsawardhini, Sarveswaran, Kengatharaiyer, Tjhi, William Chandra

arXiv.org Artificial Intelligence

The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future.